AITopics | voice activity detection

Collaborating Authors

voice activity detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

Nejad, Mahsa Ghazvini, Asl, Hamed Jafarzadeh, Edraki, Amin, Sadeghi, Mohammadreza, Asgharian, Masoud, Yu, Yuanhao, Nia, Vahid Partovi

arXiv.org Artificial IntelligenceOct-16-2025

Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker by incorporating speaker embeddings from enrollment utterances. Unlike existing methods that require architectural changes, such as FiLM layers, our approach employs a hypernetwork to modify the weights of a few selected layers within a standard voice activity detection (VAD) model. This enables speaker conditioning without changing the VAD architecture, allowing the same VAD model to adapt to different speakers by updating only a small subset of the layers. We propose HyWA-PVAD, a hypernetwork weight adaptation method, and evaluate it against multiple baseline conditioning techniques. Our comparison shows consistent improvements in PVAD performance. HyWA also offers practical advantages for deployment by preserving the core VAD architecture. Our new approach improves the current conditioning techniques in two ways: i) increases the mean average precision, ii) simplifies deployment by reusing the same VAD architecture.

ad model, artificial intelligence, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2510.12947

Country: North America > Canada > Quebec > Montreal (0.28)

Genre: Research Report (0.65)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.95)

Add feedback

Tiny Noise-Robust Voice Activity Detector for Voice Assistants

Asl, Hamed Jafarzadeh, Nejad, Mahsa Ghazvini, Edraki, Amin, Asgharian, Masoud, Nia, Vahid Partovi

arXiv.org Artificial IntelligenceJul-31-2025

Voice Activity Detection (VAD) in the presence of background noise remains a challenging problem in speech processing. Accurate VAD is essential in automatic speech recognition, voice-to-text, conversational agents, etc, where noise can severely degrade the performance. A modern application includes the voice assistant, specially mounted on Artificial Intelligence of Things (AIoT) devices such as cell phones, smart glasses, earbuds, etc, where the voice signal includes background noise. Therefore, VAD modules must remain light-weight due to their practical on-device limitation. The existing models often struggle with low signal-to-noise ratios across diverse acoustic environments. A simple VAD often detects human voice in a clean environment, but struggles to detect the human voice in noisy conditions. We propose a noise-robust VAD that comprises a light-weight VAD, with data pre-processing and post-processing added modules to handle the background noise. This approach significantly enhances the VAD accuracy in noisy environments and requires neither a larger model, nor fine-tuning. Experimental results demonstrate that our approach achieves a notable improvement compared to baselines, particularly in environments with high background noise interference. This modified VAD additionally improving clean speech detection.

artificial intelligence, machine learning, speech, (20 more...)

arXiv.org Artificial Intelligence

2507.22157

Country: North America > Canada > Quebec > Montreal (0.29)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

Add feedback

Exploring Speaker Diarization with Mixture of Experts

Yang, Gaobin, He, Maokui, Niu, Shutong, Wang, Ruoyu, Chen, Hang, Du, Jun

arXiv.org Artificial IntelligenceJun-18-2025

--In this paper, we propose a novel neural speaker diarization system using memory-aware multi-speaker embedding with sequence-to-sequence architecture (NSD-MS2S), which integrates a memory-aware multi-speaker embedding module with a sequence-to-sequence architecture. The system leverages a memory module to enhance speaker embeddings and employs a Seq2Seq framework to efficiently map acoustic features to speaker labels. Additionally, we explore the application of mixture of experts in spkeaker diarization, and introduce a Shared and Soft Mixture of Experts (SS-MoE) module, to further mitigate model bias and enhance performance. Incorporating SS-MoE leads to the extended model NSD-MS2S-SSMoE. Experiments on multiple complex acoustic datasets, including CHiME-6, DiPCo, Mixer 6 and DIHARD-III evaluation sets, demonstrate meaningful improvements in robustness and generalization. The proposed methods achieve state-of-the-art results, showcasing their effectiveness in challenging real-world scenarios. PEAKER diarization, which aims to determine the temporal boundaries of individual speakers within an audio stream and assign appropriate speaker identities, addresses the fundamental question of "who spoke when" [1]. It serves as a foundational component in numerous downstream speech-related tasks, including automatic meeting summarization, conversational analysis, and dialogue transcription [2].

artificial intelligence, machine learning, module, (17 more...)

arXiv.org Artificial Intelligence

2506.1475

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry:

Media (0.34)
Leisure & Entertainment (0.34)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Attention Is Not Always the Answer: Optimizing Voice Activity Detection with Simple Feature Fusion

Tripathi, Kumud, Kumar, Chowdam Venkata, Wasnik, Pankaj

arXiv.org Artificial IntelligenceJun-3-2025

V oice Activity Detection (V AD) plays a key role in speech processing, often utilizing hand-crafted or neural features. This study examines the effectiveness of Mel-Frequency Cepstral Coefficients (MFCCs) and pre-trained model (PTM) features, including wav2vec 2.0, HuBERT, WavLM, UniSpeech, MMS, and Whisper. We propose FusionV AD, a unified framework that combines both feature types using three fusion strategies: concatenation, addition, and cross-attention (CA). Experimental results reveal that simple fusion techniques, particularly addition, outperform CA in both accuracy and efficiency. Fusion-based models consistently surpass single-feature models, highlighting the complementary nature of MFCCs and PTM features. Notably, our best-performing fusion model exceeds the state-of-the-art Pyannote across multiple datasets, achieving an absolute average improvement of 2.04%. These results confirm that simple feature fusion enhances V AD robustness while maintaining computational efficiency.

artificial intelligence, machine learning, representation, (16 more...)

arXiv.org Artificial Intelligence

2506.01365

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Information Fusion (0.73)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)

Add feedback

Noise-Robust Target-Speaker Voice Activity Detection Through Self-Supervised Pretraining

Bovbjerg, Holger Severin, Østergaard, Jan, Jensen, Jesper, Tan, Zheng-Hua

arXiv.org Artificial IntelligenceJan-6-2025

Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.

artificial intelligence, information, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2501.03184

Country: Europe > Denmark (0.14)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)

Add feedback

Automatic Speech Recognition for Hindi

Saha, Anish, Ramakrishnan, A. G.

arXiv.org Artificial IntelligenceJun-26-2024

Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.

activity detection, detection, transcript, (14 more...)

arXiv.org Artificial Intelligence

2406.18135

Country:

Europe > Austria > Vienna (0.14)
Asia > India > Karnataka > Bengaluru (0.14)
Asia > Indonesia > Bali (0.05)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.89)

Add feedback

Comparative Analysis of Personalized Voice Activity Detection Systems: Assessing Real-World Effectiveness

Kumar, Satyam, Buddi, Sai Srujana, Sarawgi, Utkarsh Oggy, Garg, Vineet, Ranjan, Shivesh, Ognjen, null, Rudovic, null, Abdelaziz, Ahmed Hussen, Adya, Saurabh

arXiv.org Artificial IntelligenceJun-11-2024

Voice activity detection (VAD) is a critical component in various applications such as speech recognition, speech enhancement, and hands-free communication systems. With the increasing demand for personalized and context-aware technologies, the need for effective personalized VAD systems has become paramount. In this paper, we present a comparative analysis of Personalized Voice Activity Detection (PVAD) systems to assess their real-world effectiveness. We introduce a comprehensive approach to assess PVAD systems, incorporating various performance metrics such as frame-level and utterance-level error rates, detection latency and accuracy, alongside user-level analysis. Through extensive experimentation and evaluation, we provide a thorough understanding of the strengths and limitations of various PVAD variants. This paper advances the understanding of PVAD technology by offering insights into its efficacy and viability in practical applications using a comprehensive set of metrics.

activity detection, detection, voice activity detection, (13 more...)

arXiv.org Artificial Intelligence

2406.09443

Country: North America > United States > Texas > Travis County > Austin (0.04)

Genre:

Research Report > New Finding (0.46)
Research Report > Experimental Study (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.90)

Add feedback

A Real-Time Voice Activity Detection Based On Lightweight Neural

Jia, Jidong, Zhao, Pei, Wang, Di

arXiv.org Artificial IntelligenceMay-26-2024

Voice activity detection (VAD) is the task of detecting speech in an audio stream, which is challenging due to numerous unseen noises and low signal-to-noise ratios in real environments. Recently, neural network-based VADs have alleviated the degradation of performance to some extent. However, the majority of existing studies have employed excessively large models and incorporated future context, while neglecting to evaluate the operational efficiency and latency of the models. In this paper, we propose a lightweight and real-time neural network called MagicNet, which utilizes casual and depth separable 1-D convolutions and GRU. Without relying on future features as input, our proposed model is compared with two state-of-the-art algorithms on synthesized in-domain and out-domain test datasets. The evaluation results demonstrate that MagicNet can achieve improved performance and robustness with fewer parameter costs.

activity detection, neural network, voice activity detection, (13 more...)

arXiv.org Artificial Intelligence

2405.16797

Country:

Asia > China > Shanghai > Shanghai (0.05)
Oceania > Australia > Queensland > Brisbane (0.04)
North America > United States > Oregon > Multnomah County > Portland (0.04)
(7 more...)

Genre: Research Report > New Finding (0.34)

Industry: Media (0.34)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

SECP: A Speech Enhancement-Based Curation Pipeline For Scalable Acquisition Of Clean Speech

Sabra, Adam, Wronka, Cyprian, Mao, Michelle, Hijazi, Samer

arXiv.org Artificial IntelligenceFeb-19-2024

As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $\Delta_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data.

clean speech, curation pipeline, speech, (12 more...)

arXiv.org Artificial Intelligence

2402.12482

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

Self-supervised Pretraining for Robust Personalized Voice Activity Detection in Adverse Conditions

Bovbjerg, Holger Severin, Jensen, Jesper, Østergaard, Jan, Tan, Zheng-Hua

arXiv.org Artificial IntelligenceJan-23-2024

In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD. The trained models are systematically evaluated on both clean speech and speech contaminated by various types of noise at different SNR-levels and compared to a purely supervised model. Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning.

noise, utterance, vad model, (16 more...)

arXiv.org Artificial Intelligence

2312.16613

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Denmark > North Jutland > Aalborg (0.04)
Asia > South Korea > Seoul > Seoul (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback